Modeling prosodic dynamics for speaker recognition
نویسندگان
چکیده
Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at shortterm spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches that use the fundamental frequency and energy trajectories to capture long-term information. The first approach uses bigram models to model the dynamics of the fundamental frequency and energy trajectories for each speaker. The second approach uses the fundamental frequency trajectories of a pre-defined set of words as the speaker templates and then, using dynamic time warping, computes the distance between the templates and the words from the test message. The results presented in this work are on Switchboard I using the NIST Extended Data evaluation design. We show that these approaches can achieve an equal error rate of 3.7%, which is a 77% relative improvement over a system based on short-term pitch and energy features alone.
منابع مشابه
Incorporating Prosodic with Acoustic information for ISCSLP’2006 Speaker Recognition Evaluation- Robust Cross-Channel Speaker Verification
In this paper, we present our speaker verification (SV) systems for the cross-channel text-independent and dependent speaker verification (TI-SV and TD-SV) tasks of ISCSLP’2006 speaker recognition evaluation (ISCSLP2006-SRE). To address the cross-channel issues and take advantage of the unique characteristics of Mandarin (i.e., tonal language), prosodic contours are modeled to assist the state-...
متن کاملDuration and pronunciation conditioned lexical modeling for speaker verification
We propose a method to improve speaker recognition lexical model performance using acoustic-prosodic information. More specifically, the lexical model is trained using durationand pronunciation-conditioned word N-grams, simultaneously modeling lexical information along with their acoustic and prosodic characteristics. Support vector machines are used for modeling and scoring, with N-gram freque...
متن کاملSVM modeling of "SNERF-grams" for speaker recognition
We describe a new approach to modeling idiosyncratic prosodic behavior for automatic speaker recognition. The approach computes prosodic features by syllable (syllablebased nonuniform extraction region features, or “SNERFs”), and models the syllable-feature sequences (“SNERF-grams”) using support vector machines (SVMs). We evaluate performance on development data for a system submitted to the N...
متن کاملSpeaker recognition using the resynthesized speech via spectrum modeling
Recently, using prosodic information such as pitch and energy for speaker recognition has attracted much attention. However, these kinds of systems yield performance much worse than the traditional cepstral based systems. Limited performance improvement can be achieved when combining the two kinds of systems. In this paper, we present a new approach for speaker recognition, which uses the proso...
متن کاملPitch-dependent GMMs for text-independent speaker recognition systems
Gaussian mixture models (GMMs) and ergodic hidden Markov models (HMMs) have been successfully applied to model short-term acoustic vectors for speaker recognition systems. Prosodic features are known to carry information concerning the speaker’s identity and they can be combined with the short-term acoustic vectors in order to increase the performance of the speaker recognition system. In this ...
متن کاملProsodic features for speaker verification
In this paper we study the effectiveness of prosodic features for speaker verification. We hypothesize that prosody is linked to linguistic units such as syllables and prosodic features can be better represented with reference to the syllabic sequence. For extracting prosodic features, speech is segmented into syllablelike regions using the knowledge of vowel onset points (VOP). We use a techni...
متن کامل